This analysis derives a transcriptomic signature that predicts survival for renal carcinoma patients. Using transcriptomic data from NIH Genomic Data Commons (GDC), abnormally expressed genes are derived via differential expression analysis (DEA) of tumor vs. normal associated tissue (NAT). The resulting deferentially expressed genes (DEG) where assessed using a Kaplan-Meier survival analysis & Cox Proportional Hazards (COXPH) model to determine which genes best predict patient outcome. The resulting prognostic signature shows robust predictive value of appx. AUC 0.8 & includes a list of genes with well-known associations to tumor progression, including NF𝛋B, DNA repair, & HIF-1⍺ signaling pathways.
Acronyms
DEA - Differential expression analysis
DEG - Differentially expressed genes
GDC - Genomic Data Commons (Nation Cancer Institute)
NAT - Normal associated tissue (GDC calls this “Solid tissue normal”)
KIRC - Kidney renal clear cell carcinoma
PCA - Principal components analysis
FDR - False discovery rate (adjusted p value)
KM - Kaplan-Meier Analysis
COXPH - Cox Proportional-Hazards model
AUC - Area under the curve; a metric of sensitivity & specificity of a predictive model.
OPTIONAL: Make sure GDC client is installed on your machine. Default setting uses API. https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
OPTIONAL: Input a file path for your downloaded transcriptomic data in the first code chunk below. Total downloaded data are 2.5GB as of 5/16/22.
Run this script (click “Knit” above) (takes ~15min to run on my laptop). If you encounter problems with GDC download, update the TCGAbiolinks package and other BioC packages, as described in “Preliminaries” Section below.
Stage 1 - Assess data structure & design Limma model
1.1) GDC is queried for human Kidney Renal Clear Cell Carcinoma (KIRC) bulk RNA sequencing of tumor and normal associated tissue (NAT) samples.
Source: https://portal.gdc.cancer.gov/projects/TCGA-KIRC
1.2) Clinical metadata are profiled to look for confounders that need to be corrected for in the limma model design. For example, tumor and NAT samples showed differences in their tumor stage profile, meaning tumor stage needed to be corrected for in differential expression analysis comparing tumor & NAT.
1.3) Library sizes were plotted to determine whether VOOM is needed. VOOM is typically used for DEA when library size variance is high. Library size variance is calculated (e.g. determine whether to use VOOM). This is done by first setting up a preliminary design matrix, normalizing the data, and plotting the library distributions & normalization correction factors.
Stage 2 - Determine which genes show abnormal expression in cancer. (DEA; tumor vs NAT)
2.1) Re-run GDC query with a filtered samples list (some samples with small lib size were removed)
2.2) Limma/VOOM were used to obtain DEG (NAT vs. tumor)
2.3) Determine whether tumor stratification is needed. Asking which genes significantly predict patient outcomes should take into consideration whether there are multiple distinct tumor phenotypes. E.g. if there are two+ tumor sub-types, it may be appropriate to create two+ survival model for each subtype. Correlation heat mapping and PCA were used to visualize the number of major sample populations & and assess whether tumor sample stratification might be appropriate (no clear basis for stratification was observed).
Stage 3 - Derivation of a prognostic signature using predictive survival modeling (Kaplan Meier & Cox Prop. Hazards)
3.1) Create a survival model using the top abnormally expressed genes (DEG from NAT vs. tumor DEA), were used to determine which genes would predict survival, yielding a prognostic gene signature.
3.2) Create a Cox Proportional Hazards Model using the prognostic genes identified in the Kaplan-Meier analysis. The primary COXPH output is an AUC value over time as an assessment of the overall predictive value of the prognostic signature genes.
Get Started:
Input the file path to which you want to download the GDC data. GDC KIRC transcriptomic counts will be directed to this folder. Once the directory has been set, run all code: (click “Run” or “Knit” above).
library(rstudioapi)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # setwd to source file location
dir.create("GDCdata")
## Warning in dir.create("GDCdata"): 'GDCdata' already exists
target_directory <- file.path(getwd(), "GDCdata") # Optional: change target path for GDC data (~2.5GB)
library(knitr)
knitr::opts_chunk$set(error = FALSE, cache = TRUE, message = FALSE, warning = FALSE, echo = FALSE, include = TRUE, dpi = 300)
Even if you have the following BioC packages installed, consider updating them, as GDC updates their packages frequently.
| sample_type | n |
|---|---|
| Primary Tumor | 540 |
| Solid Tissue Normal | 72 |
Info on GDC metadata: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=demographic&anchor=age_at_index
In preparation for the limma differential expression analysis (DEA) comparing tumor and normal associated tissue (NAT), clinical metadata is examined here to ask whether several clinical factors may confound the DEA. For example, is the tumor stage profile or the proportion of males different between tumor & NAT? Such clinical variables need to be corrected for statistically in the limma model.
– Plotting continuous variables –
Assessment: age_at_index, age_at_diagnosis, and tumor size (longest_dimension) are not significantly different between primary solid tumor and solid tissue normal (NAT), & are therefore not incorporated into the limma design matrix.
Examine metadata for sample variables that would confound a comparison of tumor & NAT.
– Plotting discrete variables –
Assessment: ajcc_pathologic_stage is signficantly different between tumor & NAT, and should be corrected for in the final limma model. Difference in sex-ratio between tumor vs. NAT will also be corrected for (even though chi-sq p-val showed differences are not statistically significant), due to male-female genetic distance.
Create preliminary design matrix comparing sample groups “Solid Tissue Normal” (NAT) to “Primary solid Tumor”
Create DGEList object (S4) that contains counts, norm counts & library sizes
Remove low expressed genes
## Formal class 'DGEList' [package "edgeR"] with 1 slot
## ..@ .Data:List of 3
## .. ..$ : int [1:28142, 1:612] 3195 38 1122 490 201 1287 1706 2621 1785 1251 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. ..$ :'data.frame': 612 obs. of 72 variables:
## .. ..$ :'data.frame': 28142 obs. of 10 variables:
## ..$ names: chr [1:3] "counts" "samples" "genes"
Here, sample library sizes, normalization factors, and expression distributions are visualized to look for any data abnormalities (i.e. samples w/ poor sequencing or biological coverage). The red dashed line shows the library filtering threshold. The removal of small libraries eliminates ffpe samples, and also eliminates samples with less data.
Visualizing these data also informs the ‘Analysis Stage 2’ decision to use VOOM for the differential expression analysis, which is useful were library size variance is high.
Scale normalization of the RNAseq read counts using TMM (trimmed mean of M component) normalization method.
Plot library normalization factors
Plot library expression distributions
Re-do GDC query, preparation, & normalization using a new filtered sample list (samples with larger library sizes). Then use a Kaplan-Meier analysis and COXPH model to find genes the predict survival.
So far…
A GDC query and download was done to visualize the metadata, identifing potential confounders that should be corrected for in the final voom/limma design matrix below. The library sizes and normalization factors were plotted to find outliers, and a sample list excluding very small libraries was created.
Next…
The GDC query will be set up again using the filtered sample list. Design matrix and normalization factors are computed again to ensure compatibility with the new query.
Voom does a transformation & differential expression calculation. VOOM was chosen over limma-trend because the library sizes are variable between samples (>3-fold differences). The VOOM transformation is applied to the normalized & filtered DGEList object ‘dge’.
In the voom plot below, each point is one gene, with x axis showing the avg expression in log2 cpm reads and y axis is the square-root of the residual stdev of expression in log2 counts per million reads.
Voom transformation uses the experiment design matrix, and produces an EList (S4) object
After VOOM, the usual LIMMA pipelines can be run, starting w fitting a model
Derive & plot top significantly (BH FDR10) differentially expressed genes (DEG) from the limma model.
## .
## non-signif. signif.
## 13735 14060
## .
## non-signif. signif.
## 24165 3630
Tabulate n values for Limma differential expression analysis above.
| sample_type | n |
|---|---|
| Primary Tumor | 511 |
| Solid Tissue Normal | 72 |
Plot top 20 DEG (derived comparing tumor vs. NAT):
A useful prognostic signature will ideally mean a single signature that predicts survival for all or most cases of a defined indication such as KIRC. For clinicians, it is generally impractical to determine which phenotypic subcluster a tumor transcriptome belongs to and then choose a prognostic signature on the basis of such a classification.
Nonetheless, a single “one-size-fits-all” prognostic signature may not be possible if there are multiple, distinct tumor sub-populations within the KIRC classification. To explore this possibility, tumor sample relationships are visualized here to determine whether multiple prognostic signatures may be required.
Assessment of PCA plots below: Although there are clearly two populations–NAT and tumor–the tumor samples do not clearly require stratification. There is one, albeit phenotypically diverse tumor population.
The PCA plots above & scree plot below show that PC1 represents the major component of difference between healthy and tumor tissues. PC’s 1 thru 4 do not show distinct sub-populations within the KIRC tumor samples, although there are a few phenotypic outliers. Based on this observation, a single prognostic signature for all tumor samples will be derived and tested for predictive power below.
Use a Spearman correlation heat map of the VOOM-normalized series matrix to look at the prevalence of outliers and the number of tumor sample populations. If multiple, clearly-distinct tumor populations are present, sample stratification may be needed (i.e. comparing NAT to each stratified tumor population).
A survival model will be constructed below to see whether a reliable prognostic signature can be derived from an un-stratified set of tumor samples.
Abnormally expressed DEG (found by comparing tumor vs. NAT), were used here to assess their predict value toward overall survival. The response variable combines “days_to_last_follow_up” & “days_to_death”, using “vital_status” (death) as a censored variable. The output is a list of genes with prognostic value, and these are used in the subsequent coxph model.
KM Analysis input are tumor samples only, examining genes that are a) in the top 10% of genes terms of high expression variance b) are significantly differentially expressed between tumor and NAT (Benjamini-Hochberg FDR10).
Data filtered prior to KM Analysis:
DEG < FDR10
Top 10% of variance
Tumor samples only
The KM analysis above generates a list of genes was found that significantly predicted survival (BH FDR10) . However, p-values have limited utility as sample numbers increase, so this list is refined below include samples with high fold changes (cutoff was 2^0.3219281 = 25% change).
| ENSG | pval | FDR | FC_log2 |
|---|---|---|---|
| ENSG00000073150.14 | 0.0000 | 0.0000000 | 0.3261780 |
| ENSG00000104313.20 | 0.0000 | 0.0000000 | 0.3308153 |
| ENSG00000119547.6 | 0.0000 | 0.0000000 | 0.3637163 |
| ENSG00000123838.11 | 0.0000 | 0.0000000 | 0.3906049 |
| ENSG00000155269.12 | 0.0000 | 0.0000000 | 0.3768961 |
| ENSG00000179698.14 | 0.0000 | 0.0000000 | 0.4308977 |
| ENSG00000181449.4 | 0.0000 | 0.0000000 | 0.3269393 |
| ENSG00000184363.10 | 0.0000 | 0.0000000 | 0.3531697 |
| ENSG00000198768.11 | 0.0000 | 0.0000000 | 0.3315854 |
| ENSG00000225783.8 | 0.0000 | 0.0000000 | 0.3728919 |
| ENSG00000231748.1 | 0.0000 | 0.0000000 | 0.3495725 |
| ENSG00000236453.5 | 0.0000 | 0.0000000 | 0.3565338 |
| ENSG00000236499.2 | 0.0000 | 0.0000000 | 0.3235660 |
| ENSG00000236591.1 | 0.0000 | 0.0000000 | 0.3382766 |
| ENSG00000267121.6 | 0.0000 | 0.0000000 | 0.4104828 |
| ENSG00000268649.5 | 0.0000 | 0.0000000 | 0.3858131 |
| ENSG00000154277.13 | 0.0001 | 0.0008323 | 0.3575658 |
| ENSG00000175265.17 | 0.0001 | 0.0008323 | 0.3905267 |
| ENSG00000197599.12 | 0.0001 | 0.0008323 | 0.3954686 |
| ENSG00000106327.13 | 0.0002 | 0.0014442 | 0.3486972 |
| ENSG00000169885.10 | 0.0002 | 0.0014442 | 0.4053682 |
| ENSG00000225217.1 | 0.0002 | 0.0014442 | 0.3763401 |
| ENSG00000228727.9 | 0.0002 | 0.0014442 | 0.3584158 |
| ENSG00000229759.1 | 0.0003 | 0.0020096 | 0.3664934 |
| ENSG00000263345.1 | 0.0003 | 0.0020096 | 0.3411169 |
| ENSG00000213658.12 | 0.0004 | 0.0024547 | 0.3672919 |
| ENSG00000236017.8 | 0.0004 | 0.0024547 | 0.3358611 |
| ENSG00000100288.20 | 0.0005 | 0.0028958 | 0.3754193 |
| ENSG00000250067.12 | 0.0007 | 0.0037640 | 0.3474407 |
| ENSG00000253317.1 | 0.0009 | 0.0046162 | 0.3347877 |
| ENSG00000271959.1 | 0.0009 | 0.0046162 | 0.3274670 |
| ENSG00000234537.1 | 0.0013 | 0.0061778 | 0.3463489 |
| ENSG00000205702.11 | 0.0020 | 0.0086068 | 0.3995944 |
| ENSG00000236871.7 | 0.0034 | 0.0130733 | 0.3338875 |
| ENSG00000287089.1 | 0.0039 | 0.0145141 | 0.3581816 |
| ENSG00000271821.1 | 0.0048 | 0.0172850 | 0.3544788 |
| ENSG00000277883.1 | 0.0137 | 0.0393044 | 0.3507425 |
| ENSG00000241769.7 | 0.0188 | 0.0505455 | 0.3679429 |
| ENSG00000284874.1 | 0.0239 | 0.0607887 | 0.3826835 |
In order to further refine the prognostic gene list, and to generate an AUC for the genes that predict survival according to the Kaplan-Meier Analysis, a COXPH model us used here. The genes with the most significant hazard ratios are used to calculate an AUC over time.
Input the Kaplan-Meier genes into a Cox Proportional Hazards (COXPH) model, in order to visualize their hazard ratios and p values in a volcano plot for the entire data set.
Determine AUC using genes with absolute hazard ratio >=10%
https://datascienceplus.com/time-dependent-roc-for-survival-prediction-models-in-r/
Derive an AUC from only significant genes (from COXPH model for all data). One of the 7 genes, IP6K3, is removed here as it is does not have significant predictive power in the more targeted model (6 genes) below. Above, the top genes are filtered above are further refined to a smaller panel that would be feasible in a clinical setting (i.e. 5-10 gene transcripts).
Testing and plotting a AUC over time for the top 7 prognostic genes.
Tabulate coefficients for the 7-gene COXPH model
Kaplan-Meier univariate survival curves showing the top 7 predictive genes from the COXPH model above.
Input here into the new cox model are the top genes from the KM analysis above, but this time, incorporating the three clinical vars to see if they enhance the model.
Tabulate coefficients for the 7-gene COXPH model along with age, gender, & ajcc pathologic stage.
A prognostic signature of six transcripts was derived here. This signature predicts survival with the Chambless and Diao’s estimator of cumulative/dynamic AUC = 0.8 for the COXPH model. This predictive capacity that remains durable over 12 years after index (sample collection time). The addition of clinical variables to the model–including tumor stage & patient age at index–does add moderate predictive capacity to the model, but the advantage is only slight. These prognostic transcripts include transcripts known to contribute to tumor progression, including NF𝛋B signaling (LAT) , HIF-1⍺ signaling (ONECUT2), as well as inositol hexakisphosphate kinase 3 (IP6K3) & glucose-regulated protein 78 (GPR78), both of which have a previously defined negative prognostic value in renal cancer. Prognostic genes also included cytochrome P450, CYP2D7 consistent therapeutic resistance predicting survival. Expression of these genes provide valuable prognostic information for KIRC patients.
References:
ONECUT2 - Accelerates tumor progression in gastric, CRC, prostate, hepatocellular cancers. Androgen resistance and HIF1alpha signaling pathways.
https://www.nature.com/articles/s41467-018-08133-6
IP6K3 - prognostic in renal cancer
https://www.proteinatlas.org/ENSG00000161896-IP6K3
LAT - NFkB signalling; immune signalling; MHC and TCR signaling
https://www.proteinatlas.org/ENSG00000213658-LAT/pathology/renal+cancer
GPR78 - prognostic in renal cancer
see kaplan-meier plot https://www.proteinatlas.org/ENSG00000155269-GPR78/pathology/renal+cancer
LINC00896
https://pubmed.ncbi.nlm.nih.gov/34790382/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8576217/
SAPCD1 - DNA mismatch repair; prognostic in cancer
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7391198/
CYP2D6 - drug resistance / detoxification https://www.frontiersin.org/articles/10.3389/fphar.2010.00121/full
Long-non-coding RNA’s show prognostic value in KIRC